Snowflak

Snowflake is a SaaS-based Data Warehouse platform built over AWS,Microsoft Azure, and Google Cloud infrastructure. One of the features behind this software’s popularity with businesses worldwide is its scalability, making it cost-effective. The architecture involves virtual compute instances and efficient storage buckets that run solely on the cloud.It serves as a centralized platform for data management, data lakes, data engineering, data applications development, data science, and secure sharing and consumption of real-time and shared data. Providing a centralized system to consolidate all data, Snowflake revolutionized the data warehousing industry. With Snowflake, you can simplify data warehouse management without sacrificing features.     

Explain Snowflak Architecture

The Snowflake architecture is a hybrid of shared-disk (A common disk or storage device is shared by all computing nodes) and shared-nothing (Each computing node has a private memory and storage space) database architecture in order to combine the best of both. Snowflake utilizes a central data repository for persistent data, which is available to all compute nodes similar to a shared-disk architecture. But, equally, as with shared-nothing architectures, Snowflake uses massively parallel computing (MPP) clusters for query processing, in which each node stores part of the whole data set locally.

All data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage. Each micro-partition contains between 50 MB and 500 MB of uncompressed data (note that the actual size in Snowflake is smaller because data is always stored compressed).Micro-partitioning is automatically performed on all Snowflake tables. Tables are transparently partitioned using the ordering of the data as it is inserted/loaded.

The Snowflake architecture is divided into three key layers as shown below:   

  • Database Storage Layer: Once data has been loaded into Snowflake, this layer reorganizes that data into a specific format like columnar, compressed, and optimized format. The optimized data is stored in cloud storage. Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.
  • Query Processing Layer: In the processing layer, queries are executed using virtual warehouses. Virtual warehouses are independent MPP (Massively Parallel Processing) compute clusters comprised of multiple compute nodes that Snowflake allocates from cloud providers. Due to the fact that virtual warehouses do not share their compute resources with each other, their performance is independent of each other.
  • Cloud Services Layer: It provides services to administer and manage a Snowflake data cloud, such as access control, authentication, metadata management, infrastructure management, query parsing, optimization, and many more.

What do you mean by virtual warehouse?

In Snowflake, a virtual warehouse, often known as a "warehouse," is a collection of computational resources. A virtual warehouse provides the resources required for the users like CPU, memory, and temporary storage to perform multiple Snowflake operations:


Execute the SQL SELECT statements that necessitate the use of computing resources  (e.g. retrieving rows from tables and views).

DML operations include:

Updating table rows (DELETE , INSERT , UPDATE).

Data Loading into tables (COPY INTO <table>).

Data unloading from tables (COPY INTO <location>).


Tuning

  • Data Loading -: Avoid Scanning External Files,Snowflake supports querying data in both ways: managed and external. Data kept on external storage will have slower performance as compared to data managed by Snowflake internally. So, if we can move data to Snowflake we should consider following supported ways by Snowflake for this.
  • Snowpipe – Enables loading data from files as soon as they’re available in a stage. This means you can load data from files in micro-batches, making it available to users within minutes, rather than manually executing COPY statements on a schedule to load larger batches. So, Snowpipe helps in quickly creating Managed Table, and not to be confused with External Table.
Data Design/Transformation

Clustering Keys
For very large tables, typically over a terabyte in size, designers should consider defining a cluster key to maximize query performance. Using a cluster key maximizes partition elimination, and therefore improves query performance.

To illustrate the effect of Snowflake Clustering, consider the diagram below in which data is frequently filtered in the query WHERE clause by DATE.


As data is loaded by date, it tends to be naturally clustered, with all data for the same day falling into the same micro-partition. However, if the following SQL is executed, Snowflake will attempt to keep all sale dates in the same micro-partition. When needed, a background task will automatically re-cluster the data, and this compute processing will be charged as a separate item.

ALTER TABLE sales CLUSTER BY (sales_date);

Creating one or more materialised views (clustered or unclustered)
A materialized view is a pre-computed data set derived from a query specification (the SELECT in the view definition) and stored for later use. Because the data is pre-computed, querying a materialized view is faster than executing a query against the base table of the view. This performance difference can be significant when a query is run frequently or is sufficiently complex. As a result, materialized views can speed up expensive aggregation, projection, and selection operations, especially those that run frequently and that run on large data sets.
create materialized view v1 as
select * from table1 where column_1 between 100 and 400;
Deciding When to Create a Materialized View or a Regular View
In general, when deciding whether to create a materialized view or a regular view, use the following criteria:

Create a materialized view when all of the following are true:
  • The query results from the view don’t change often. This almost always means that the underlying/base table for the view doesn’t change often, or at least that the subset of base table rows used in the materialized view don’t change often.
  • The results of the view are used often (typically significantly more often than the query results change).
  • The query consumes a lot of resources. Typically, this means that the query consumes a lot of processing time or credits, but it could also mean that the query consumes a lot of storage space for intermediate results.
Create a regular view when any of the following are true:
The results of the view change often.
  • The results are not used often (relative to the rate at which the results change).
  • The query is not resource intensive so it is not costly to re-run it.
Materialized Views and Clustering
https://dataforgeeks.com/snowflake-performance-tuning-and-best-practices/2338/


What are the types of tables in Snowflake?
Temporary Tables:

Snowflake supports creating temporary tables to store transient, non-permanent data.Temporary tables exist only within the session. They are created and persist only for the session remainder. They are not visible to other sessions or users and don’t support standard features like cloning. Therefore the data stored in the system is cleaned entirely and is not recoverable either by the user-created table or Snowflake.

To create a temporary table, specify the TEMPORARY keyword in CREATE TABLE.

Transient Tables:
Snowflake supports creating Transient tables that continue until dropped explicitly and are available to all the users with the relevant privileges.To maintain transitory data beyond each session, transient tables are designed. Temporary tables are similar to permanent tables with the vital difference in their absence of a Fail-safe period.

To create a temporary table:
After creation, transient tables cannot be converted to another table type.

Permanent Tables:
Permanent tables have a Fail-safe period similar to transient tables and provide additional security of data recovery and protection. Presently, the permanent table cannot be modified to Transient Table using ALTER TABLE command. Property of TRANSIENT is set at the creation of a table and cannot be customized. In the same way, it isn’t possible to change directly a transient table to a permanent table.

In order to convert an existing transient table to a permanent table (or vice versa) through protecting data and other characteristics such as granted privileges and column defaults, you can create a new table and use the COPY GRANTS clause, then copy the data.

Let’s understand with a few examples. 

Explain Snowflake Time Travel
Snowflake Time Travel tool allows us to access the past data at any moment in the specified period. Through this, we can see the data that we can change or delete. Through this tool, we can carry out the following tasks:
  • Restore the data-associated objects that may have lost unintentionally.
  • For examining the data utilization and changes done to the data in a specific time period.
  • Duplicating and backing up the data from the essential points in history.
What is Data Retention Period in Snowflake?

The data retention period is an important aspect of Snowflake Time Travel.
When data in a table is modified, such as deletion or discarding an object holding data, Snowflake saves the data's previous state. The data retention period determines the number of days that this historical data is kept and, as a result, Time Travel operations (SELECT, CREATE... CLONE, UNDROP) can be performed on it.

The standard retention period is one day (24 hours) and is enabled by default for all Snowflake accounts.








No comments:

Post a Comment